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Abstract. Computational color constancy is a very important topic in 
computer vision and has attracted many researchers' attention. Recently, 
lots of research has shown the effects of high level visual content informa- 
tion for illumination estimation. However, all of these existing methods 
are essentially combinational strategies in which image's content analy- 
sis is only used to guide the combination or selection from a variety of 
individual illumination estimation methods. In this paper, we propose a 
novel bilayer sparse coding model for illumination estimation that con- 
siders image similarity in terms of both low level color distribution and 
high level image scene content simultaneously. For the purpose, the im- 
age's scene content information is integrated with its color distribution 
to obtain optimal illumination estimation model. The experimental re- 
sults on two real-world image sets show that our algorithm is superior 
to other prevailing illumination estimation methods, even better than 
combinational methods. 



1 Introduction 

The color signals of any object from an imaging device are determined by three 
factors: the color of light incident on the scene, the surface reflectance of the 
object, and sensor sensitivity function of the camera [1] [2]. Therefore, the color 
of same surface will usually appear differently under varying light sources. In 
contrast, the human beings have the ability to "see" a surface as having the 
same color independent of variations of the illumination, which is called "Color 
Constancy" [3]- Computational color constancy is targeted for providing the 
same sort of color stability in the context of computer vision [3] , and its central 
issue is to build up an optimal illumination color estimation model. 

Illumination estimation is actually an ill-posed problem and cannot be solved 
without any assumption. It has been an active research topic in both scientific 
community and imaging industry for several decades. Most early studies treat 
an image as a bag of pixels with RGB values and give out the illumination es- 
timation model without considering the underlying semantic content expressed 
by the pixels' arrangement. We name these methods as "Data Driven Estima- 
tion Methods" (DD). Some of them, such as Grey World (GW)[5], maxRGB [B], 
Shades of Grey (SoG)jT], and Edge-based method jS], predefine fixed models 
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based on certain hypotheses for all images; while the others, including Color- 
by-Correlation (C-by-C) [5], Neural Networks based method (NN) [115], Support 
Vector Regression based method (SVR) [11] . Gamut Mapping [12] etc, learn the 
optimal estimation model on the 2D/3D chromaticity space through a training 
procedure. Once the model is fixed in a DD method, the light colors of all the test 
images are computed out using the same model. Therefore, these methods are 
effective only when the distribution of colors within the image fits the assumed 
model. 

Recent years have witnessed a rise in applying image content analysis in il- 
lumination estimation. We name these methods as "Content Driven Estimation 
Methods" (CD). So far, all the existing CD methods are essentially combina- 
tional methods [T3J that generally contain two steps: (1) applying several indi- 
vidual estimation models (rather than only one) on the same image, (2) then 
selecting the best estimate or combining their outputs based on the image's con- 
tent characteristics. Previous efforts in this area include the work of Gijsenij 
[14] . which selects the most appropriate estimation method based on natural 
image texture statistics and scene semantics (NIS). Lu et al. [13] use 3D stage 
geometry model (SG) to divide images into different geometrical regions, and 
select appropriate estimations per depth layer or geometrical section. Bianco et 
al. [16] propose to use the indoor-outdoor scene classification for choosing the 
most appropriate estimation method. Weijier et al. [17] model an image as a 
mixture of semantic classes, such as sky, grass, road, and building, evaluate sev- 
eral different illumination estimations on the likelihood of its semantic content 
in correspondence with prior knowledge of the world, and produce the final out- 
put that results in the most likely semantic composition of the image. Although 
these methods have shown that high level scene content is useful for illumination 
estimation [13], they are inevitably affected by selected individual methods and 
automatically visual content analysis, in which the latter one itself, such as 3D 
stag classification or indoor-outdoor classification, is another difficult computer 
vision problem. 

In this paper, we model illumination estimation as an image similarity prob- 
lem and propose a novel bilayer sparse coding model that considers low level color 
distribution and high level scene category simultaneously. Our work is primar- 
ily inspired by two hypotheses: (1) The images with similar color distributions 
are preferable to be captured under the similar light colors; and (2) the scenes 
belonging to the same high level category have the similar illumination condi- 
tions [16 a . This is because the varying range of light colors in a certain type of 
scene is often limited. For example, indoor lights tend to be red; while outdoor 
lights are mostly bluish. The first hypothesis has been validated in many DD 
methods that train estimation model using color chromaticity histogram. The 
second one has also been shown to be effective in some CD methods [2] [IS] [IB] • 
The approach described here is similar to what Bianco et al. [16] have done, but 
we do not explicitly classify the scene into predefined scene categories; then use 
the classification output to guide illumination estimation candidates' selection 
or combination. Instead, we integrate high level scene category analysis into the 
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illumination estimation procedure so as to avoid negative impact of false scene 
classification. Another contribution in the proposed method lies in, comparing 
with those DD methods that always use a prefixed (or learned) model for all the 
test images, that our model sparsely represents each test image by adaptively 
selecting training samples according to color and scene cues. The experiments 
on two real-world image sets show that our method outperforms other prevail- 
ing illumination estimation methods, even outperforms the CD combinational 
methods. 

The remainder of this paper is organized as follows. In section 2, we briefly 
introduce the sparse coding technique. The details of the proposed method are 
presented in Section 3. Experimental results and further analysis are given out 
in Section 4. Section 5 concludes this paper. 

2 Sparse Coding Preliminaries 

Before introducing the details of our model, we start with a brief overview of 
sparse coding that is the basis of the proposed algorithm. Recently, much inter- 
est has been shown in computing linear sparse representation with respect to an 
overcomplete dictionary of the basis elements. The goal of sparse coding is to 
sparsely represent input vectors approximately as a weighted linear combination 
of a number of "basis vectors" . Given an input vector x £ R k and basis vectors 
U = [ui,U2, ... j ii w ] € R hxn , sparse coding aims to find a sparse vector of coeffi- 
cients a € R n , such that x ~ Ua = J2j u j a j- It equals to solving the following 
objective. 

min \\x — Ua\\ 2 + A ||a|| , (1) 

a 

where ||a|| denotes the ^o-norm, which counts the number of nonzero entries 
in a vector a. It is well known that the sparsest representation problem is NP- 
hard in general case. Fortunately, recent results (181 show that, if the solution 
is sparse enough, the sparse representation can be recovered by the following 
convex £i-norm minimization [18] as: 

minllz-Ua^ + AlHL, (2) 

where the first term of Eq(2) is the reconstruction error, and the second term 
is used to control the sparsity of the coefficients vector a with the £i-norm. A is 
regularization coefficient to control the sparsity of A . The larger A implies the 
sparser solution of A. The sparse coding technique based on £i-norm has been 
widely applied in many practical applications, including face recognition, image 
classification, etc [T5] . 

3 Bilayer Sparse Coding for Illumination Estimation 

In this section, we firstly propose bilayer sparse coding model (BSC) for illumi- 
nation estimation; then discuss color feature and scene feature used in BSC; and 
finally give out an optimization algorithm for BSC. 
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3.1 Bilayer Sparse Coding Model (BSC) 

Sparse Coding for Color Similarity Analysis. Given N training images 
Ji, In, the color feature vector of the image U is d £ R d . Here, color feature 
Ci can be binarized 2D/3D chromaticity histogram that has been proved to be 
effective for many supervised color constancy algorithms [0J [T0J [H] ■ For any test 
image I y with color feature C y £ R d , we can linearly reconstruct its color feature 
using the training images under the sparse coding framework, as: 

min||C,-C 7 ||2 + A|| 7 || 1 , (3) 

7 

where C = [Ci, C2, Cjv]; 7 = [71, 72, 7jv] T is a TV-dimensional coefficient 
vector that indicates the reconstruction weight associated with each training 
image. From viewpoint of color gamut, the Eq(3) is actually to reconstruct the 
color gamut of the test image using color gamut of all the training images. The 
sparse code 7 can also be viewed as the color correlation coefficient between I y 
and each training image. 

Sparse Coding for Scene Category Similarity Analysis. Besides color 
(or chromaticity) distribution, high level scene category cues embedded in the 
image can also improve the accuracy of illumination estimation [16] . Generally 
speaking, a typical type of scene is determined by a bag of certain objects and 
their co-occurrence relationships [TO]. For example, the 'street' scene sometimes 
contains roads and buildings. 

To model appearances of different objects in the scene, we segment each 
training image into objects, denoted as if, if,..., I"*, then we have ni+n 2 + 
... + njv objects in the training image set in all. Each object if is represented by 
visual vocabulary histogram v\ 6 R m that is gained from Bag-of- Words model 
(BOW) [20]; and all the objects in l\ are denoted as V t = [vj , vf, v™'} £ 
i? mxni . The test image l y is also segmented into n y objects Iy,Iy, ■■■,Iy v , their 
corresponding vocabulary histograms are represented as vh, v y , v y v £ R m . 
The scene category similarity analysis here is to reconstruct the n y objects in 
the test image by using the m + n 2 + ••• + objects in the training images, as 
show in Fig. 1. 

Considering co-occurrence property of objects in the same image, we should 
try to reconstruct the objects in the test image using those objects from the 
same training image. Therefore, we introduce the multi-task joint sparse repre- 
sentation based on norm [21]. The multi-task joint sparse representation can 
be regarded as a combinational model of group Lasso and multi-task Lasso by 
penalizing the sum of £2 norms of the blocks of coefficients associated with each 
covariate group (objects in each training image) across different reconstruction 
tasks (object reconstruction in the test image) [2Tj . 

For any test object P y in the test image I y , if W- £ R ni denotes the recon- 
struction coefficient associated with the objects I}, I?,..., Jj 1 * in the image I{, we 
can use W, = [W/, Wf , W™ y } £ R niXn y to represent the reconstruction coef- 
ficient matrix of all the objects in I y associated with all the objects in the image 
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Fig. 1. Sparse reconstruction of image's scene content: (A) test images I y and its seg- 
mented objects. (B) Training image Ii and its segmented objects, W[ , (j = 1, 2, n y ) 
is a reconstruction coefficient vector of the j th object in I y associated with all the ob- 
jects in I\. (C) Training images 7jv and its segmented objects, W J N , (j = 1, 2, n y ) is 
reconstruction coefficient vector of the j object in I y associated with all the objects 
in I N . 



Ii . The details of corresponding relationship between objects and coefficient are 
shown in Fig. 1. The joint sparse representation of all the objects in the test 
image can be formulated as [21] : 



n y 

min > 
w t-^ 



N 

E 

i=l 



N 
i=l 



(4) 



where W = [W\, W2, VFAf] T is the sparse coefhcient matrix for all the objects 
in the test image; (3 is the regularization coefhcient. The optimization problem 
in Eq(4) , which is known as multi-task joint covariant selection in Lasso related 
research, can be effectively solved by mixed-norm Accelerated Proximal Gra- 
dient (APG) algorithm proposed by Yuan et al [2T] , 



Bilayer Sparse Coding for Illumination Estimation(BSC). In order to 
integrate scene category information into illumination estimation model, a bi- 
layer sparse coding model for illumination estimation is formulated to include 
similarity analysis on both color distribution and scene category, as: 
Color Layer: 

min||C,-C 7 ||^ + A||D 7 || 1 , 
B^dzag(f(\\W 1 \\l),f(\\W 2 \\ 1 2 ),...J(\\W N \\ 1 2 )), 
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Scene Layer: 



min >^ 
w ^— ' 

3=1 



N 



i=i 



JV 



/3E^lli)ll^l 



i=l 



where 



/(M 2 ) = 



ff(ll7i|li) 



max (\\WkWl) 



' 12 



max (HWfcllJ) 

k=l..N z 



min (||W fc ||*)' 



max . (||7fc||i) 



Nlli 



max . (hfelli) 

k—l..N 



min CllTfe Mi) 
fc=l..JV 



(6) 



(7) 



(8) 



From the formulation above, we can find that the function /(||Wi|| 2 ) and 
ffGWIi) are the monotone decreasing functions. Their outputs are between [0, 1] 
and can be viewed as the costs in sparse color reconstruction and sparse scene 
content reconstruction. In the color layer, it tends to select the images with lower 
/(||Wj|| 2 ) values, which is corresponding to the higher t\£ norm || Wi Ha of the 
scene reconstruction coefficient Wi, to reconstruct the test image's color feature. 
Similarly, in the scene layer, it tends to select the images with lower ff(||7t||i), 
which is corresponding to the higher l\ norm ||7i||i of the color reconstruction 
coefficient 7.;, to reconstruct the test image's scene content. Comparing Eq(5) 
with Eq(3) can tell us that the 7 in BSC model contains not only color correlation 
but also scene content correlation information. The optimization of the BSC 
model will be discussed in section 3.3. 



Illumination Estimation. The coefficient 7 in Eq(5), which represents the cor- 
relation between the test image and all training images, is used for illumination 
estimation. To remove the shading effect, the ground truth illumination color 
value = (Ri,Gi, Bi) T of the training image is mapped into 2D chromatic- 

— — * . And the coefficient vector 7 is 



ity space through: li = 



also normalized by l\ norm as: 7 



So the final illumination chromaticity 



Ri+Gi+Bt ' Ri+Gi+Bi 

_±d_ 

~ IItIIi 

l y = (r y ,g y Y of the test image can be estimated as the weighted average of the 
illumination values of all the training images as: 



l y = L7, L = [h,h, In] 



(9) 



3.2 Feature Extraction 

This section discusses the feature extraction in the BSC model. In the color re- 
construction layer, we consider 3D color space as |llj : two chromaticity values, 

defined as (r,g) T = (^ r+ q +b , a + g+_b ) > an< ^ one intensity value, defined as 
L = (R + G + B). The chromaticity space (r,g) T is equally partitioned along 
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each component into 50 equal parts yields 2500 bins. The intensity L is quan- 
tized into 25 equal steps [TJ[22], so the 3D color histograms consist of 62,500 
(50 x 50 x 25) bins [11] . Each image is represented as a binarized 3D chromaticity 
histogram, in which T' or '0' indicates the presence or absence of the correspond- 
ing chromaticity and intensity in the image. Since < r+g < 1, a compact 3D 
chromaticity histogram can be obtained by discarding the space with r+g > 1. 

In the scene layer, the SIFT descriptor [23] that is widely applied in scene 
classification, is used as object's visual feature under the Bag-of-Word (BoW) 
model [5D]. In order to remove the reciprocity of the two layers, the grey scale 
SIFT descriptor is used . The dense SIFT descriptors are extracted with 8-pixel 
step for each image. Then all the SIFT descriptors in the training images are 
clustered as m visual words via K-Means scheme. Finally, each segmented region 
with the corresponding SIFT descriptors in it is represented as a m-dimensional 
visual words histogram 6 R m via BoW model. 



Table 1. Pseudo-code for bilayer sparse coding optimization. 



Algorithm 1: Optimization of bilayer sparse coding 

1: Input: The color feature d and scene feature Vi — [vl,vf, ...,i>™*] 6 ^ mxn i Q f 
each training image, the color feature C y and scene feature V y = [i)y, v yi Vy v ] 
of the test image,the regularization coefficient A and f3, the threshold e. 

2: Initialize D = diag(l, 1, 1), solve 7 in Eq(5) via the ^1,2 mixed-norm APG. 

3: Repeat 

Set p = 7. 

For i = I : N Do 

Compute p(||7i|li)- 
End 

Solve W in Eq(6) with ffdlT-i Hi) v i a the £1,2 mixed-norm APG algorithm. 

For i = 1 : N Do 

Compute f(\\Wi\\l). 
End 

Solve 7 in Eq(5) with /(||Wi|| 2 ) via the £1,2 mixed-norm APG algorithm. 
Until (||p — 7I < e or max iteration times are arrived). 

4: Output: 7 and W. 
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3.3 Optimization for Bilayer Sparse Coding Model 

The optimization in Eq(5) and Eq(6) is not straightforward. However, if the value 
of 7 is fixed, the optimization in scene layer is just a multi-task joint sparse 
coding, which can be effectively solved via the mixed-norm Accelerated 
Proximal Gradient (APG) algorithm [31]. On the other hand, if the coefficient 
matrix W is given, the optimization in color layer is just a general sparse coding 
with a cost constrain that can also be solved by the £1.2 mixed-norm APG. 
Consequently, we give an approximate iterative mixed-norm APG algorithm 
to optimize the bilayer sparse coding as shown in table 1. 

At each iteration, the new values of 7 or W is obtained for the next iteration. 
The \\p — 7 1| < e, which indicates the distance between successive solutions of 
7, is the stopping condition of the iterations. The threshold e is fixed as 0.05 in 
this paper. 

4 Experiments 

We evaluate the proposed BSC algorithm on two real-world image sets. The 
first one is provided by Gehler et al. [33] [25] and subsequently reprocessed by 
Shi et al. [26 (denoted as Gehler-Shi set); the second one includes the real- world 
images captured from a digital video provided by Ciurea et al (denoted as SFU 
set) [27]. The BSC method is compared with both some leading DD methods, 
, including GW [5], maxRGB [5J, SoG [7], Grey Edge (e°< 13 ' 2 , e 1 ' 1 ' 6 , e 2 ' 1 ' 5 )^, 
NN[TU], SVR[TT]; and some CD combinational methods, such as NISQI], SGfTB] 
and 10 [TBJ. The parameter settings for NN, SVR are determined according to 
the settings in [TU] and [TB], respectively. The parameters in NIS, 10 and SG 
are the same as those used in [13]. The binarized 3D color histogram is also used 
in SVR method. There are three parameters that are regularization coefficients 
A, /3 and number of visual words m in BoW need to be set in advance in the 
BSC algorithm. In order to simplify the parameter selection, we fix A = /3 = 0.1, 
the number of visual word m is set to be 200, 400, 800, respectively. The JSEG 
algorithm [2 8 is used to segment each object in the image due to its flexibility in 
adjusting the number of regions. After segmentation, those regions whose areas 
are less than 1/20 of the whole image will be removed. In order to further validate 
the effect of the scene category for illumination estimation, the single color layer 
in BSC (denoted as SSC) excluding any scene cue is also used in comparison. 
The matrix D in SSC is always fixed as D = diag(l, 1, 1). 

4.1 Error Measurement 

The error measurements is one of the most important issues in experiments. 
For each image in the image sets, the ground truth chromaticity of the light 
source e a — (r a ,g a , b a ) is known. To measure how close the estimated illumina- 
tion resembles the true color of the light source, the angular error measurement, 
which is the angular distance between the estimated illumination chromatic- 
ity e y = (r y ,g y ,l — r y — g y ) T and the ground truth chromaticity e Q , is adopted 
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to evaluate the performances of diverse algorithms. The angular error function 
angular (e y ,e a ) is defined as 



angular(e y , e a ) = cos 1 ( 6y 6a 

V \\ e y\\ \\ e a\ 



(10) 



where e y • e a is the dot product of e y and the e a ; and ||»|| indicates the Euclidean 
norm. To measure the performance of an algorithm on a data set, the median 
angular error is used. Alternatively, to provide more insight into the complete 
distribution of errors on an image set, we also compute the trimean on a data 
set. The trimean value introduced by Gijsenij et al [29] can be calculated as 
the weighted average of the first, second, and third quantile Qi, Qi and Q3, 
respectively: 

„ . Qi + 2Q 2 + Q 3 nn 
lrimean— . (11) 



4.2 Experimental Results on the Gehler-Shi Set 

The Gehler-Shi image set contains 568 images that are taken using two high 
quality DSLR cameras (Canon 5D and CanonlD) and includes a wide variety 
of indoor and outdoor shots. All the images were saved in Canon RAW format. 
Because the tiff images provided by Gehler et al [M| in this set were produced 
automatically, they contain clipped pixels that are non-linear (i.e., have gamma 
or tone curve correction applied) and include the effect of the camera's white 
balancing. To avoid these problems, Shi et al. [26] reprocessed the raw data and 
created almost-raw 12-bit Portable Network Graphics (PNG) format images. 
This results in a 2041 x 1359 (for Canon ID) or 2193 x 1640 (for Canon 5D) 
linear images (gamma=l) in camera RGB space. Consequently, the reprocessed 
Gehler-Shi set is used in the following experiments. 

Although the correlations among images in this set are much lower [21] , the 3- 
fold cross-validation strategy is still conducted to avoid overfitting. Considering 
that all images in this set are named in the sequence in which they were taken and 
neighboring images in the sequence are more likely than others to be of similar 
scenes, we ordered them by their filenames and divide them into 3 subsets. 
The first two subsets each include 189 images and the remaining one includes 
190 images. During the experiment, one subset is picked as test set; the other 
two are used as training set. This procedure is repeated 3 times with different 
test set selection, the overall performance is used as the final result. The final 
experimental results are shown in Table 2. 

The results in Table 2 show that, among all the existing methods that are 
compared with the proposed method in the paper, SVR and 10 obtain best 
performances in their corresponding categories. Our proposed method is always 
better than SVR, no matter what value m is set. Specially, the BSC method 
with m = 800 achieves the best performance; and its median and trimean errors 
are 2.06 and 2.71, which also outperform the IO method. The BSC methods 
with m = 200 and m — 400 are comparable to IO method. However, we should 
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Table 2. Comparison of performance of all competing methods on the Gehler-Shi image 
set. Bold font indicates the minimum in each category. The Do Nothing method always 
estimates the illuminate as being white (r = g = 6). Its performance is a measure of 
the amount of variation in the illumination color occurring within the image set. 



Category 


Algorithm 


Median Trimean 




Do Nothing 


4.80 


7.53 




GW 


3.70 


4.09 




SoG 


4.59 


6.01 




maxRGB 


9.18 


9.77 


Data Driven 


e 0,13,2 


4.72 


5.88 




e l,l,6 


3.63 


4.01 




e 2,1.5 


3.71 


4.09 




NN 


4.52 


5.80 




SVR 


2.76 


3.13 




NIS 


2.26 


2.89 


Content Driven 


IO 


2.23 


2.83 




SG 


2.69 


3.18 




BSC(m = 200) 


2.49 


3.11 


Proposed 


BSC(m = 400) 
BSC(m = 800) 


2.28 
2.06 


2.90 
2.71 




ssc 


2.83 


3.29 



emphasize that 10 is a combinational method [13], while the propose BSC is an 
individual method. In addition, we can find that the performances of the BSC 
method arc much better than SSC method. The fact implies that high level scene 
category cues can indeed improve the illumination estimation. Furthermore, the 
SSC has the similar performance to SVR, which shows that the sparse coding 
technique is effective for illumination prediction. 

4.3 Experimental Results on the SFU Set 

The second image set is introduced by Ciurea et al. [27] which consists of more 
than 11,000 frames from videos. Since the images in this set are extracted from 
videos, there exists high correlation between nearby images. Re-sampling is nec- 
essary for objective evaluations. To this end, Bianco et al. [H] apply a video- 
based analysis to select the image to reduce the correlations. The frames which 
show redundancy in terms of visual content are removed and only the most 
representative are retained. After this procedure, 1,135 images with much lower 
correlations are picked out. Since a matte grey sphere ball is mounted onto the 
video camera to obtain the ground truth illumination of each image; in order to 
ensure that the grey ball has no effect on our results, all images are cropped on 
the right to remove the grey ball. The remaining images are 240 x 240 pixels. 

Since the SFU set contains 15 subcategories from which images are taken in 
different places, the 15-fold cross-validation is adopted here, which is the same 
as used by Gijsenij et al. [14] . To ensure that the training and testing subsets 
would be truly distinct, the Bianco's resampling set is partitioned into 15 subsets 
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based on geographical locations. Then one subset is used for testing and the other 
14 ones are used for training. This procedure is repeated 15 times. The overall 
performance is shown in Table 3. 

Table 3. Comparison of performance of all competing methods on the SFU image set. 
Bold font indicates the minimum in each category. The Do Nothing method always 
estimates the illuminate as being white (r = g = b). Its performance is a measure of 
the amount of variation in the illumination color occurring within the image set. 



Category 


Algorithm 


Median 


Trimean 




Do Nothing 


6.84 


7.40 




GW 


6.64 


7.06 




SoG 


6.19 


5.64 




maxRGB 


5.26 


6.06 


Data Driven 


e 0,lS,2 


5.55 


5.80 




e l,l,6 


5.10 


5.46 




e 2,l,8 


5.24 


5.50 




NN 


4.86 


5.20 




SVR 


4.38 


5.04 




NIS 


4.83 


5.66 


Content Driven 


IO 


4.22 


4.96 




SG 


4.97 


5.46 




BSC(m = 200) 


3.90 


4.28 


Proposed 


BSC(m = 400) 
BSC(m = 800) 


3.78 

3.95 


4.27 

4.31 




SSC 


4.57 


5.20 



The same as previous experiment, SVR and IO methods still occupy the 
best positions in their corresponding categories. The similar conclusion to pre- 
vious experiment can also be obtained. The proposed BSC methods with dif- 
ferent m values, whose median errors are even less than 4.00, still outperform 
all other methods; even have better performances than content driven combina- 
tional method IO. The fact that all the BSC methods outperform SSC further 
confirms the effect of scene category cues in illumination estimation. 

Finally, let's review the potential benefits of high level scene content analysis 
in illumination estimation using two examples. The image shown in Fig. 2(A) 
is captured indoor. If we only consider color distribution via SSC, matching 
figures with three highest 7* values are shown from Fig. 2(A1) to (A3). Two 
of them ( Fig. 2(A2) and (A3)) come from different scene categories and bias 
the final illumination estimation. By contrast, if we add high level scene content 
analysis into BSC, all the images with three highest 7$ values are from similar 
indoor scenes, and the final estimation is much closer to the real value. Another 
example in Fig. 2(B) is an outdoor scene with 'road'. If image scene content is 
considered using BSC, three most correlated images are all outdoor scenes also 
with 'roads', whose contributions result in accurate illumination estimation. The 
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examples further indicate that the constraint of high level scene category in BSC 
can actually improve illumination estimation. 
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Fig. 2. Comparison between BSC(m = 400) and SSC. (A)(B): test images. (A-l)-(A- 
3) and (B-l)-(B-3): images with three highest 7$ values associated with corresponding 
test images in SSC. (A-5)-(A-7) and (B-5)-(B-7): images with three highest yt values 
associated with corresponding test images in BSC. (A-4) and (B-4): corrected images 
using the estimates of SSC. (A-8) and (B-8): corrected images using the estimates of 
BSC. The angular errors (degree) in corrected images imply the accuracies of estimated 
illumination. 



5 Conclusion 



Image's high level content cue has been evidenced to be helpful in improving the 
illumination estimation for color constancy. However, nearly all the prevailing 
methods using high level content cues can be viewed as combinational methods 
that include two separate steps: applying several individual estimation models 
and combining their outputs through analyzing the test image's scene content. 
In this paper, we integrate image's color distribution and scene analysis into a 
unihcd framework, and propose a novel bilayer sparse coding model for illumi- 
nation estimation problem. The experiments on two real-world image sets show 
that the mutually constrained combination can significantly improve the accu- 
racy of illumination estimation and our proposed algorithm is superior to other 
prevailing methods, even better than the CD combinational methods. 
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